Increasing Skew Insensitivity of Decision Trees with Hellinger Distance

نویسندگان

  • David A. Cieslak
  • Nitesh V. Chawla
چکیده

Learning from unbalanced datasets presents a convoluted problem in which traditional learning algorithms typically perform poorly. The heuristics used in learning tend to favor the larger, less important classes in such problems. While other methods, like sampling, have been introduced to combat imbalance, these tend to be computationally expensive. This paper proposes Hellinger distance as a means to induce decision trees. We demonstrate that Hellinger distance is skew insensitive, especially as compared to information gain. This leads the Hellinger Distance Decision Tree (HDDT) induction algorithm and we demonstrate significant performance improvement on problems within the realm of class imbalance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Decision Trees for Unbalanced Data

Learning from unbalanced datasets presents a convoluted problem in which traditional learning algorithms may perform poorly. The objective functions used for learning the classifiers typically tend to favor the larger, less important classes in such problems. This paper compares the performance of several popular decision tree splitting criteria – information gain, Gini measure, and DKM – and i...

متن کامل

Using HDDT to avoid instance propagation in unbalanced and evolving data streams

Hellinger distance has been successfully used as a tree splitting criterion in Hellinger Distance Decision Trees [10] (HDDT) for unbalanced static datasets. In unbalanced data streams, state-of-the-art techniques use instance propagation and standard decision trees to cope with the unbalanced problem. However it is not always possible to revisit/store old instances of a stream. We solve this pr...

متن کامل

The Hellinger distance in Multicriteria Decision Making: An illustration to the TOPSIS and TODIM methods

Due to the difficulty in some situations of expressing the ratings of alternatives as exact real numbers, many well-known methods to support Multicriteria Decision Making (MCDM) have been extended to compute with many types of information. This paper focuses on the information represented as probability distribution. Many of the methods that deal with probability distribution use the concept of...

متن کامل

Building Decision Trees for the Multi-class Imbalance Problem

Learning in imbalanced datasets is a pervasive problem prevalent in a wide variety of real-world applications. In imbalanced datasets, the class of interest is generally a small fraction of the total instances, but misclassification of such instances is often expensive. While there is a significant body of research on the class imbalance problem for binary class datasets, multi-class datasets h...

متن کامل

Probability-possibility DEA model with Fuzzy random data in presence of skew-Normal distribution

Data envelopment analysis (DEA) is a mathematical method to evaluate the performance of decision-making units (DMU). In the performance evaluation of an organization based on the classical theory of DEA, input and output data are assumed to be deterministic, while in the real world, the observed values of the inputs and outputs data are mainly fuzzy and random. A normal distribution is a contin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008